Analysis of a Classification-based Policy Iteration Algorithm
نویسندگان
چکیده
We present a classification-based policy iteration algorithm, called Direct Policy Iteration, and provide its finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity.
منابع مشابه
Analysis of Classification-based Policy Iteration Algorithms Analysis of Classification-based Policy Iteration Algorithms
We introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, i.e., the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound in t...
متن کاملAnalysis of a Classification-based Policy Iteration Algorithm
We present a classification-based policy iteration algorithm, called Direct Policy Iteration, and provide its finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approxim...
متن کاملAnalysis of Classification-based Policy Iteration Algorithms
We introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound i...
متن کاملApproximate modified policy iteration and its application to the game of Tetris
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are exten...
متن کاملConservative and Greedy Approaches to Classification-Based Policy Iteration
The existing classification-based policy iteration (CBPI) algorithms can be divided into two categories: direct policy iteration (DPI) methods that directly assign the output of the classifier (the approximate greedy policy w.r.t. the current policy) to the next policy, and conservative policy iteration (CPI) methods in which the new policy is a mixture distribution of the current policy and th...
متن کامل